In [1]:
%matplotlib inline
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import os

Hands-on: Linear Regression - Complex Shapes

Test 1
Input Features: x
Output / Target: y_noisy
Objective: Underfitting demo

Test 2
Input Features: x, x^2
Output / Target: y_noisy
Objective: How adding relevant features improves predicting accuracy


In [2]:
def quad_func (x):
    return 5 * x ** 2 -23 * x + 47

In [3]:
# Training Set + Eval Set: 200 samples (70%, 30% split)
# Test Set: 60 samples
# Total: 260 samples

In [4]:
np.random.seed(5)
samples = 260
x_vals = pd.Series(np.random.rand(samples) * 20)
x2_vals = x_vals ** 2
y_vals = x_vals.map(quad_func)
y_noisy_vals = y_vals + np.random.randn(samples) * 50

In [5]:
df = pd.DataFrame({'x': x_vals, 
                   'x2': x2_vals ,
                   'y': y_vals, 
                   'y_noisy': y_noisy_vals})

In [6]:
df.head()


Out[6]:
x x2 y y_noisy
0 4.439863 19.712387 43.445077 88.950606
1 17.414646 303.269900 1162.812637 1193.704875
2 4.134383 17.093124 37.374807 62.355709
3 18.372218 337.538400 1312.130983 1254.553770
4 9.768224 95.418196 299.421832 268.896012

In [7]:
df.corr()


Out[7]:
x x2 y y_noisy
x 1.000000 0.968304 0.948299 0.940940
x2 0.968304 1.000000 0.997515 0.991770
y 0.948299 0.997515 1.000000 0.994777
y_noisy 0.940940 0.991770 0.994777 1.000000

In [8]:
fig = plt.figure(figsize = (12, 8))
plt.scatter(x = df['x'],
            y = df['y'],
            color = 'r',
            label = 'y',)
plt.scatter(x = df['x'],
            y = df['y_noisy'],
            color = 'b',
            label = 'y noisy', 
            marker = '+')
plt.xlabel('x')
plt.ylabel('Target Attribute')
plt.grid(True)
plt.legend()


Out[8]:
<matplotlib.legend.Legend at 0x256c8271ac8>

In [9]:
data_path = '..\Data\RegressionExamples\quadratic'

In [10]:
df.to_csv(os.path.join(data_path,'quadratic_example_all.csv'),
          index = True,
          index_label = 'Row')

Training and Evaluation Set

Training Set 1: RowNumber, x, y_noisy
Training Set 2: RowNumber, x, x ** 2, y_noisy


In [11]:
df[df.index < 200].to_csv(os.path.join(data_path, 'quadratic_example_train_underfit.csv'),
                          index = True,
                          index_label = 'Row', 
                          columns = ['x', 'y_noisy'])

In [12]:
df[df.index < 200].to_csv(os.path.join(data_path, 'quadratic_example_train_normal.csv'),
                          index = True,
                          index_label = 'Row',
                          columns= ['x', 'x2', 'y_noisy'])

In [13]:
df.to_csv(os.path.join(data_path, 'quadratic_example_test_all_underfit.csv'), 
          index = True,
          index_label = 'Row', 
          columns = ['x'])

In [14]:
df.to_csv(os.path.join(data_path, 'quadratic_example_test_all_normal.csv'),
          index = True,
          index_label = 'Row', 
          columns = ['x', 'x2'])

In [15]:
# Pull Predictions
# Prediction without quadratic term
df = pd.read_csv(os.path.join(data_path,'quadratic_example_all.csv'), 
                 index_col = 'Row')
df_predicted_underfit = pd.read_csv(os.path.join(data_path, 'output_underfit',
                                                 'bp-pNYIAR35aSV-quadratic_example_test_all_underfit.csv.gz'))
df_predicted_underfit.columns = ["Row", "y_predicted"]

In [16]:
fig = plt.figure(figsize = (12, 8))
plt.scatter(x = df.x,
            y = df.y_noisy,
            color = 'b',
            label = 'actual', 
            marker = '+')
plt.scatter(x = df.x,
            y = df_predicted_underfit.y_predicted ,
            color = 'g',
            label = 'Fit (x)',
            marker = '^')
plt.title('Quadratic - underfit')
plt.xlabel('x')
plt.ylabel('Target Attribute')
plt.grid(True)
plt.legend()


Out[16]:
<matplotlib.legend.Legend at 0x256c8459e10>

Test 1: Training RMSE: 385.18, Evaluation RMSE: 257.89, Baseline RMSE: 437.31 Wojciech results: Training RMSE: 385.16, Evaluation RMSE: 257.898, Baseline RMSE: 437.311

RMSE for the model is large and closer to baseline


In [17]:
fig = plt.figure(figsize = (12, 8))
plt.boxplot([df.y_noisy, df_predicted_underfit.y_predicted], 
            labels = ['actual','predicted-underfit'])
plt.title('Box Plot - Actual, Predicted')
plt.ylabel('y')
plt.grid(True)



In [18]:
df.y_noisy.describe()


Out[18]:
count     260.000000
mean      492.434283
std       478.849813
min      -112.575294
25%        77.826912
50%       327.241317
75%       874.702202
max      1664.910364
Name: y_noisy, dtype: float64

In [19]:
df_predicted_underfit.y_predicted.describe()


Out[19]:
count     260.000000
mean      662.497185
std       409.042715
min       -40.808170
25%       301.881100
50%       675.825400
75%      1036.082500
max      1354.002000
Name: y_predicted, dtype: float64

In [20]:
df_predicted_normal = pd.read_csv(os.path.join(data_path,'output_normal',
                                               'bp-In6EUvWaCw2-quadratic_example_test_all_normal.csv.gz'))
df_predicted_normal.columns = ["Row", "y_predicted"]

In [21]:
fig = plt.figure(figsize = (12, 8))
plt.scatter(x = df.x,
            y = df.y_noisy,
            color = 'b',
            label = 'actual', 
            marker ='+')
plt.scatter(x = df.x,
            y = df_predicted_underfit.y_predicted,
            color = 'g',
            label = 'Fit (x)',
            marker = '^')
plt.scatter(x = df.x ,
            y = df_predicted_normal.y_predicted ,
            color = 'r',
            label = 'Fit (x,x^2)')
plt.title('Quadratic - normal fit')
plt.grid(True)
plt.xlabel('x')
plt.ylabel('Target Attribute')
#plt.legend()


Out[21]:
<matplotlib.text.Text at 0x256c859e1d0>

Test 1: Training RMSE: 385.16, Evaluation RMSE: 257.89, Baseline RMSE: 437.31

Test 2: Training RMSE: 132.20, Evaluation RMSE: 63.68, Baseline RMSE: 437.31

Test 2 RMSE is much better compared to baseline. Do note that we added approx -50 to 50 noise value to y


In [22]:
fig = plt.figure(figsize = (12, 8))
plt.boxplot([df.y_noisy,df_predicted_underfit.y_predicted, df_predicted_normal.y_predicted], 
            labels = ['actual','predicted-underfit','predicted-normal'])
plt.title('Box Plot - Actual, Predicted')
plt.ylabel('y')
plt.grid(True)



In [23]:
df_predicted_underfit.head()


Out[23]:
Row y_predicted
0 0 269.2752
1 1 1177.0090
2 2 247.9033
3 3 1244.0020
4 4 642.0548

In [24]:
df_predicted_normal.head()


Out[24]:
Row y_predicted
0 0 53.94965
1 1 1201.89800
2 2 44.75586
3 3 1345.26700
4 4 346.27510

Summary

  1. Underfitting occurs when model does not accurately capture relationship between features and target
  2. Underfitting would cause large training errors and evaluation errors.
    Training RMSE: 385.1816 Evaluation RMSE: 257.8979. Baseline RMSE:437.311
  3. Evaluation Summary - Prediction overestimation and underestimation histogram provided by AWS ML console provides important clues on how the model is behaving. Ideally, under-estimation and over-estimation needs to be balanced and centered around 0.
  4. Box plot also highlights distribution differences between predicted and actual
  5. To address underfitting, add higher order polynomials or more relevant features to capture complex relationship
    Training RMSE: 132.2032 Evaluation RMSE: 63.6847. Baseline RMSE:437.311
  6. When working with datasets containing 100s or even 1000s of features, it important to rely on these metrics and distribution to gain insight into model performance.